Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

186 ◾ Bioinformatics

5.3.7.5 Normalization

After filtering the low-expressed genes, we can normalize the count data. EdgeR uses TMM

to compute normalization factor that corrects sample-specific biases. Without normaliza-

tion, if only few genes have high expression, those genes will account for a substantial

proportion of the library size for a specific sample, causing other genes to be under-rep-

resented. The normalization factor is multiplied by the library size to yield the effective

library size, which is used for normalization. The following function calculates the TMM

normalization factor:

yNorm <- calcNormFactors(y)

Notice that as shown in Figure 5.11, the normalization factor was changed for each sample.

5.3.7.6 Estimating Dispersions

The next step is to use the above normalized count data to estimate the dispersions which

will be used to estimate the parameters of the negative binomial model as discussed above.

As there are only few replicates or samples, estimation of the gene-wise dispersions based

on the count vector of the gene across replicates will not be accurate. EdgeR uses informa-

tion sharing between genes to estimate dispersion; genes of closely similar abundance will

FIGURE 5.10 DGEList object after filtering out genes with low gene expression.